Summarize by Aili

Generative Verifiers: Reward Modeling as Next-Token Prediction

🌈 Abstract

The paper proposes Generative Verifiers (GenRM), which recast verification as next-token prediction in large language model (LLM) reasoning domains. Key points:

GenRM is a more performant alternative to discriminative reward models, and unlocks the use of powerful tools like chain-of-thought reasoning and majority voting for better verification.
GenRM unifies generation and verification into a single LLM, and demonstrates that such unification benefits both generation and verification.
GenRM can effectively utilize synthetic model-generated rationales, which are noisy and sub-optimal, to identify reasoning errors in grade school math problems.

🙋 Q&A

[01] Comparing GenRM with Prior Verification Approaches

1. How does GenRM compare to standard discriminative verifiers and other approaches on reasoning tasks?

GenRM, which directly predicts Yes/No token for verification, can match or outperform the discriminative reward model (RM) and other approaches like LLM-as-a-Judge and self-consistency on algorithmic tasks like Last Letter Concatenation and Word Sorting, as well as the GSM8K math reasoning task.
GenRM-CoT, which combines chain-of-thought with majority voting, further improves the performance over direct GenRM.
On GSM8K, GenRM-CoT consistently outperforms all other methods, even when using model-generated (rather than human-written) verification rationales.

2. How does GenRM's use of chain-of-thought reasoning and majority voting impact its performance?

With oracle verification CoTs, GenRM-CoT closely matches the performance of an oracle verifier on the algorithmic tasks.
On GSM8K, GenRM-CoT is able to detect subtle reasoning errors that are missed by discriminative verifiers, by leveraging the chain-of-thought rationales.
Majority voting across multiple CoT rationales generated by GenRM-CoT further boosts its accuracy, allowing it to nearly match the performance of an oracle verifier on algorithmic tasks.

[02] Unifying Generation and Verification

1. How does unifying solution generation with verification impact GenRM's performance?

Unifying solution generation with verification, as done by GenRM using the next-token-prediction objective, consistently improves verification performance across all tasks compared to training GenRM solely on verification data.
Incorporating CoT verification data into the generator's training mix leads to better solution generation performance for the GenRM-CoT verifier itself.
This suggests that teaching the verifier to imitate correct solutions through next-token prediction is mutually beneficial for both generation and verification.

[03] Scaling Data, Model Size, and Inference-time Compute

1. How does GenRM-CoT's performance scale with increased inference-time compute?

GenRM-CoT's performance scales gracefully with greater number of CoT rationale samples used for majority voting, outperforming greedy decoding performance within just 4 votes.
Across different Gemma model scales (2B, 7B, 9B), the finetuned GenRM-CoT verifier outperforms the LLM-as-a-Judge approach, which also utilizes CoT and majority voting but with a more capable Gemini 1.0 Pro model.

2. How does GenRM's performance scale with increasing model size and training data?

The performance of GenRM and GenRM-CoT verifiers scales positively with an increase in Gemma model capacity, matching the expectation that larger models can learn more from the same data under the next-token prediction loss.
For GenRM-CoT on GSM8K, using multiple rationales per solution has a substantial positive effect on both RM accuracy and Best-of-N performance, suggesting the model benefits from the "ensembling" effect of training on noisy synthetic rationales.
Direct GenRM verifiers trained only on verification data still outperform standard discriminative RMs as the amount of training data increases, demonstrating the effectiveness of casting verification as a next-token prediction problem.

[04] Impact of Synthetic Rationale Quality

1. How does the quality of synthetic rationales impact GenRM-CoT's performance on GSM8K?

Using reference-guided grading to generate the synthetic rationales significantly improves GenRM-CoT's performance on GSM8K compared to using unguided synthetic rationales.
This indicates that LLMs are better able to identify reasoning errors when provided with a reference solution for comparison, even when using the same model (Gemini 1.0 Pro) to generate both the solutions and rationales.

</output_format>

Shared by Daniel Chen ·

Install fromChrome Web Store